Estimating the population exposed to a risk factor over a time window: A microsimulation modelling approach from the WHO/ILO Joint Estimates of the Work-related Burden of Disease and Injury

Objectives Burden of disease estimation commonly requires estimates of the population exposed to a risk factor over a time window (yeart to yeart+n). We present a microsimulation modelling approach for producing such estimates and apply it to calculate the population exposed to long working hours for one country (Italy). Methods We developed a three-model approach: Model 1, a multilevel model, estimates exposure to the risk factor at the first year of the time window (yeart). Model 2, a regression model, estimates transition probabilities between exposure categories during the time window (yeart to yeart+n). Model 3, a microsimulation model, estimates the exposed population over the time window, using the Monte Carlo method. The microsimulation is carried out in three steps: (a) a representative synthetic population is initiated in the first year of the time window using prevalence estimates from Model 1, (b) the exposed population is simulated over the time window using the transition probabilities from Model 2; and (c) the population is censored for deaths during the time window. Results We estimated the population exposed to long working hours (i.e. 41–48, 49–54 and ≥55 hours/week) over a 10-year time window (2002–11) in Italy. We populated all three models with official data from Labour Force Surveys, United Nations population estimates and World Health Organization life tables. Estimates were produced of populations exposed over the time window, disaggregated by sex and 5-year age group. Conclusions Our modelling approach for estimating the population exposed to a risk factor over a time window is simple, versatile, and flexible. It however requires longitudinal exposure data and Model 3 (the microsimulation model) is stochastic. The approach can improve accuracy and transparency in exposure and burden of disease estimations. To improve the approach, a logical next step is changing Model 3 to a deterministic microsimulation method, such as modelling of microflows.


Introduction
The World Health Organization (WHO) and the International Labour Organization (ILO), supported by a large number of individual experts [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20], produce estimates of exposure to selected occupational risk factors and, consecutively, of the work-related burden of disease attributable to these exposures (i.e. the burden of disease that can be attributed to exposure to a particular risk factor in the past) [21][22][23][24]. These estimates are called the WHO/ILO Joint Estimates of the Work-related Burden of Disease and Injury (WHO/ILO Joint Estimates; see https://www.who.int/teams/environment-climate-change-and-health/monitoring/who-ilojoint-estimates). A crucial step when producing the WHO/ILO Joint Estimates, as with estimating other burden of disease attributable to exposure to a risk factor, is estimating the prevalence of exposure to this risk factor over a time window (Fig 1). A time window is defined as the "period in which an exposure can have adverse or protective effects on development and subsequent disease outcome" [25]. Improved accuracy in estimating exposure prevalence leads to more accurate burden of disease estimation. WHO and ILO have developed tools and approaches to use in systematic reviews of prevalence of exposure [26][27][28]. In this article, we describe a novel modelling approach from the WHO/ILO Joint Estimates, which can be used for producing estimates for all burden of disease studies, but here, we focus on work-related We seek to estimate the burden of disease (such as the number of deaths or disability-adjusted live years lost) at year a that is attributable to past exposure to a risk factor. For this, we require estimates of the number of persons exposed to an exposure category over the time window (year t to year t+n ). The known or assumed lag time is year t+n/2 to year a and we place the time window around the lag year. We then seek to estimate the number of persons exposed to the category during the time window. https://doi.org/10.1371/journal.pone.0278507.g001

PLOS ONE
Estimating the population exposed to a risk factor over a time window (or occupational) burden of disease where possible. The approach was developed after WHO and ILO identified the need for improved modelling of exposure over time for the production of official estimates of exposure to occupational risk factors, taking into consideration transitions between exposure categories over time.
Two main different methods exist for estimating prevalence of exposure over a time window (S2 Box). First, the Global Burden of Disease Study used occupation as a proxy for exposure to some occupational risk factors [29][30][31][32]. The prevalence of exposure to an occupational risk factor in the lag year (according to the lag time i.e. the time period between exposure to the risk factor and the occurrence of the health outcome) was multiplied by a constant factor of 4, the occupational turnover (OT; i.e. the ratio of the prevalence of a category of exposure to the risk factor at one year divided by the prevalence of the same exposure category over a time window), to estimate the prevalence of exposure over a time window. This OT rate did not vary over time, nor by cohort defined by sex and/or age group. Longitudinal data are not used.
Second, in the United Kingdom Burden of Cancers Study, staff turnover factors were estimated using new starters in years in the time window [33,34]. This method estimated starters in the past year as a proportion of the average number of people employed. It presented turnover rates of 4 and 6 for short-latency (20 years) and long-latency (40 years) risk exposure periods, respectively, meaning that prevalence of exposure in a year taking into account turnover is estimated to be 4 times larger (for short latency) and 6 times larger (for long latency) than the point estimate of the exposure in the same year. This method still makes limited use of longitudinal data.
In this article, we propose a new method to estimate the prevalence of exposure to an occupational risk factor over a time window, as input data for burden of disease estimates. This method makes more use of longitudinal data to model the movement of the target population between exposure categories. We describe the methods using an example with three exposure categories (41-48, 49-54, and �55 hours/week) for the occupational risk factor of exposure to long working hours over a 10-year window for Italy, using data from the European Union-Labour Force Survey (EU-LFS). We discuss the implications of our findings in relation to previous methods, including the gain in precision it provides, as well as next steps to further improve our approach.

Materials and methods
We developed a modeling approach based on three consecutive models. We produced two sets of estimates (Models 1 and 2 in Fig 2) as input data for the microsimulation model. First, the prevalence of exposure to an occupational risk factor was estimated as input data at the first year of the time window, year t (Model 1). Second, we modeled transition probabilities (i.e. the probability of changing the level of exposure between the exposure categories) for exposure categories during the time window, year t to year t+n (Model 2). Third, we developed a Monte Carlo method (MCM) based microsimulation model to estimate the exposed population over the time window based on these transition probabilities (Model 3). The methods and the data sources used in each of three models are described in detail below.

Model 1: Estimate of prevalence of exposure to the risk factor for the first year in time window
We estimated the prevalence of risk factor exposure for each exposure category for the first year in the time window using an established multilevel model that predicted the prevalence over time and geographic location [35]. The multilevel model is of the form: where proportion i is the proportion of the population in the exposure category i; year is the survey year; and sex, age and country are the sex, age group and country of the population. The intercept A i (sex, age, country) and slope B i (sex, age, country) of the year dependence of proportion i are calculated with a multilevel model, with sex and age group as fixed effects, and sex and age group as random effects, nested in the country. (Note that we modelled age group as a categorical variable, not a continuous variable as the database we used for this model only included age group categories.) We assumed that residuals follow a normal distribution (see S3 Table for R code).

Model 2: Estimate of transition probabilities of exposure categories over the time window
We estimated average transition probabilities between exposure categories by running a weighted multinomial logit regression model including sex and age as a fractional polynomial as regressors, using a matched sample of longitudinal data. From the coefficients estimated, we derived predicted probabilities of transitioning between exposure categories by cohorts defined by sex and age group. Including age as a continuous variable borrowed strength from the distribution of age to estimate age groups with limited numbers of observations. Thus, the final estimates, particularly for the highest and lowest age groups, are to some extent driven by the choice of function. This methodology builds on the approach developed by Kiiver & Espelage [36]. The multinomial logit regression model is of the form: where β j is the set of regression coefficients describing the longitudinal weights associated with the transition j; X j is a set of explanatory variables (sex and age as a fractional polynomial with maximal permitted degree of 2 associated with the transition j); and the summation (indexα) goes through all possible transitions j (except the transition from i = 0 in year t to i = 0 in year t +1 , which was chosen as a pivot outcome). Including age as a continuous variable allows us to borrow strength from the distribution of age to estimate age groups with limited numbers of observations. Thus, particularly for the highest and lowest age groups, the final estimates are to some extent driven by the choice of function. This methodology builds on the approach developed by Kiiver and Espelage [36] (in the cross sectional data we model, however, age group categories are predefined).

Model 3: Estimate of the exposed population over the time window using a microsimulation model
Each individual in the synthetic population possesses a set of attributes, which may be updated at discrete time steps. In particular, individuals are often defined as belonging to one of a finite number of mutually exclusive and collectively exhaustive states, and events of interest are modelled as transitions from one state to another that occur according to a set of deterministic and/or stochastic rules governed by transition probabilities. To derive proportion k for exposure category k, the microsimulation takes the following form: where the summation runs through individuals "alive" in the estimation year; δ k is the Kronecker delta function; S l is the sequence of all exposure categories the l th individual was assigned to in each year in the time window; and where max denotes that the highest exposure category that the l th individual experiences in the sequence. We produced estimates of the number of exposed population over the time window in three steps (A- C Fig 2, Model 3), which are detailed below.

Step A: Initialize the synthetic cohort at the first year (year t ) of the time window (year t to year t+n )
We initialized the synthetic cohort at the first year (year t ) of the time window. The individuals in this cohort were set to be representative of the distribution of sex, age and the exposure category in the general population for the country. We determined the size of the synthetic cohort, and its sex and age composition, using the size, sex and age distributions from the UN estimates for the national population [37]. The initial exposure distribution is obtained from the estimated prevalence of the exposure categories (an output from the multilevel model, Model 1).
Step B: Update the cohort over year t+1 to year t+n and censor the cohort for deaths during the time window Mutually exclusive and collectively exhaustive state of exposure category bands are defined during simulation; as individuals are simulated over time, each year the individual's age is increased by 1-year steps. Following the principles of MCM, at each step, the exposure category band is reassigned stochastically based on exposure category h i , transition probabilities h i, j , and a computer-generated uniform random number p from (0-1) (all unique to the individual): the (0-1) interval was split into six parts defined by P k j¼0 h i;j at k = 0..5 and compared with p. Exposure category h i was changed to the matching h k . At each year step, the alive state was consistently reassigned for each individual in a similar manner.
Step C: Censor the cohort for deaths after the time window At each year-step, the alive state was reassigned for each individual in a similar manner as the exposure category detailed in the previous paragraph. Here, however, the state of alive individuals only changed to dead if p was smaller than the probability of death, defined by sex and the actual age and year extracted from WHO standard life tables [38]. Those who survived until after the time window (i.e. to year a ) formed the synthetic cohort and were further analyzed.

Assessing the sensitivity of the modelling approach
The sensitivity of the modelling approach was tested by calculating the median relative error estimates for various input data uncertainty terms and their combinations using tornado plots.

Example
We applied the modelling approach to estimate the population exposed to long working hours

Data
The cross-sectional data for Model 1 were derived from the Italian EU-LFS, for the years 1983 to 2018. A detailed description of the cross-sectional EU LFS is provided elsewhere [40]. The bands of exposure to working hour h 0 −h 5 are i = 0: labour-force inactive; 1: 0-34 hours/week; 2: 35-40 hours/week; 3: 41-48 hours/week; 4: 49-54 hours/week; 5: �55 hours/week. S1 Table presents data sources, processes and steps of input variables.
The transition probability estimates are based on pseudo-longitudinal, reweighted EU-LFS microdata for Italy for the years 2010/11 to 2017/18 that are currently not accessible publicly. For rules governing access to anonymized EU-LFS microdata, see https://ec.europa.eu/ eurostat/web/microdata [41]. The analysis was restricted to people aged 15 years and older. The pseudo-longitudinal data were derived from the Italian EU-LFS by matching the data from the annually overlapping samples, averaging over quarters per year, for the years 2010/ 2011 to 2017/2018. In the absence of personal identifiers, matching was based on household number, household sequence number, sex, and year of birth [42]. This pseudo-longitudinal sample was reweighted to match target year margins for labour market status of the cross-sectional data, by sex and 10-year age group. Summary statistics on the input data can be found in the respective references and in S2 Table.

Models
Using Model 1, we estimated the prevalence of exposure at the first year of time window (year t = 2002). All cohorts aged �15 years (as defined above) were followed. Estimates were calculated for 34 such cohorts, defined by sex (female, male) and 5-year age group (17 categories: 15-19, 20-24, . . ., 90-94, �95 years), e.g., women aged 15-19 years. In this article, for simplicity and comprehension, we present estimates for cohorts defined by sex only. The computer software R was used to produce exposure prevalence for the first year in time window (S3 Table for R code). In the microsimulation model (Model 3), these estimates were used as the first set of input data (Input Data 1).
Model 2, a regression model, was used to estimate transition probabilities between the six categories of exposure h 0 . . .h 5 (as defined in the data section). Fractional polynomials with automated model selection in STATA 14 were used to model age for each separate regression run for the six exposure categories (i.e., h 0 . . .h 5 ). The transition between these bands is captured through 36 transition probabilities, 30 of which are independent. A single set of transition probabilities per cohort is used for all years in the time window (year t = 2000 to year t+n = 2011).
Model 3, a microsimulation model, was used to estimate the exposed population over the time window in three steps: (a) estimates of prevalence for exposure categories for the first year in the time window (Input Data 1 produced in Model 1); (b) transition probabilities for exposure categories over the entire time window (Input Data 2 produced in Model 2); and (c) number of persons by sex and age at first year of time window, accounting for deaths during the time window. Estimates were derived for the 22 cohorts defined above. The synthetic cohort size was set in the model to follow the age and sex distribution in 2002 and to sum to a total count of n = 200,000 to reduce the inherent random noise of MCM. At the time window's first year, the proportions in exposure categories were initialized to match the point estimates of Model 1 for that year. Transition probabilities obtained from Model 2 were selected according to the actual age of the cohort as it moved through the time window. Termination of follow-up due to death was updated each year according to year, age and sex of the cohort. Part A of Fig 3 shows the possible transitions of a synthetic individual from year t to year t+1 .
We calculated 95% uncertainty ranges (URs) using bootstrapping [43]. Estimates were produced 1000 times with starting parameters sampled independently from normal distributions with the mean equal to the corresponding point estimate and URs taken from those of the prevalence estimates per year. The 50%, 2.5% and 97.5% quantiles of the resulting random deviates of the exposures were then calculated and assigned to provide the point estimate, and the lower and upper limits of the 95% URs respectively.

Prevalence of exposure to long working hours, 2002 (Model 1 output)
In both sexes, 0.5 to 9.0% people were exposed to working hours category of 41-48 h/week, 0.2 to 4.6% were exposed to 49-54 h/week, and 0.2 to 3.4% were exposed to �55 h/week across age groups. The corresponding figures for females were 0.2 to 4.4%, 0.1 to 1.7% and 0.1 to 1.3%, and for males 1.0 to 13.6%, 0.3 to 7.5% and 0.3 to 5.6%, respectively (see S4 Table for full breakdown by sex and age group). In general, middle-aged men had higher prevalence of long working hours.

Transition probabilities over the time window (Model 2 output)
The transition probabilities (Table 1) show that the largest probability is to remain in the initial working hours bands. The most mobile working hours band is h 4 (worked 49-54 hours/week), where strong out movement is present towards both shorter and longer working hours. The least mobile band is h 0 (labour market inactive), because in the youngest (15)(16)(17)(18)(19) and oldest (65+) cohorts the largest probability transition is h 00 ; which is expected as the student or pensioner contributions are significant, respectively.

Prevalence of exposed population per year in time window (Model 3 output)
A small fraction of a synthetic cohort generated by Model 3 is shown in Part B of From the full set of the generated synthetic cohorts, prevalence of exposures were defined. Fig 4 shows the prevalence of exposure to at each exposure category in 2016 relative to the prevalence of exposure to at each respective category over the period of 2002-2011. In both sexes, the prevalence of exposure per year varies from 0 to 16.5%, 0 to 14.5%, and 0 to 15.7% for working hours categories 41-48, 49-54, and �55 h/week, respectively. The corresponding figures for females were 0 to 11.8%, 0 to 8.1% and 0 to 7.2%, and for males 0 to 21.3, 0 to 21.1 and 0 to 24.9% (see S5 Table for a full breakdown by sex and age-group). In general, middleaged men had higher prevalences.

Assessing model sensitivity
The sensitivity of the modelling approach was evaluated by calculating median relative error estimates for various input data uncertainty terms and their combinations using tornado plots. Compared to deterministic approaches, the MCM inherently produces random fluctuations of estimates due to its stochastic nature. This numerical noise is solely determined by the size of the synthetic cohort chosen by the modeler. Accordingly, it can be virtually eliminated by an appropriately large synthetic cohort. The fluctuation of the MCM exposure outcomes scale is ffi ffi ffi n 2 p where n is the size of the synthetic cohort, as is expected from the law of large numbers (S1A Fig). This observed scaling verifies that the MCM method is numerically robust and there is no loss of power. The URs of the final ("period") prevalence estimates depend on the URs of the input datasets. To quantify dependence, we repeated the MCM calculations assuming all possible combinations of input data uncertainty. The median relative fluctuation of exposure estimates is presented in S1A variables are presented in terms of relative magnitude of the estimated variance of the estimate (S1B Fig). The highest uncertainty was observed for the working hours data and the working hours transitions (median relative error 5%), followed by working hours data and working hours transitions and WHO life table (<5%). The least uncertainty was for the WHO life table (median relative error <0.05%).

Discussion
This paper describes a novel computational model for estimating exposure to a risk factor over a time window and demonstrates its application for producing official estimates of exposure to long working hours for a work-related burden of disease study (i.e., the WHO/ILO Joint Estimates). This microsimulation-based method has a number of advantages over other methods. Microsimulation models the individuals in the population and is flexible in how the results can be aggregated. It allows assessment of variations in impact with respect to socio-demographic and geographical parameters. Other applications demonstrating the flexibility of the MCMs in burden of disease estimation are, for example, the DYNAMO-HIA model [44] with which exposure to a risk factor is back-calculated from a known burden of disease (i.e., the inverse process to the method we present here) or the ATHLOS-Mic model [45] with which transition probabilities are estimated used MCMs. The Global Burden of Disease Study [29] and the United Kingdom Burden of Cancer Study [33,34] used constant values to estimate prevalence of exposure over a time window. It is important to note, however, that estimates derived using different methods cannot be made equal with a single multiplication factor, OT, for all exposure categories and cohorts simultaneously (Fig 4). This is particularly important when specific age groups carry proportionally large burden of disease, for example the case for 55 year and older males in the recent WHO/ ILO Joint Estimates of the Work-related Burden of Disease and Injury [24]. The OT approach loses the accuracy of different exposure categories and is applicable only to binary exposure models. In principle, it is possible to define different OTs for different exposure categories and, by that, effectively model different exposure categories independently. The MCM method, on the other hand, can handle arbitrary complex exposure category relations without increasing the model's complexity. We find that the OT was equal to 4.8 for the entire working-age population (i.e., aged 15 years and older in 2016) exposed to working �55 hours/week, whereas previous methods used OTs of 4 [29][30][31][32] and 6 [33,34], respectively. Previous methods that used an OT equal to 4, for example, would have underestimated the exposed population by more than 90,000 persons. Moreover, the age-and sex-disaggregated OTs, as estimated using our novel method, vary from 0 to about 20. This indicates a significant gain in precision for estimates disaggregated by age and sex, and that a single scalar multiplicative factor is less favorable.

Strength and weaknesses
The MCM microsimulation method is a versatile, flexible framework to perform arbitrary complex exposure estimation. It is possible to define exposure according to several different criteria: as described in detail here, it can be defined by the highest exposure category, but other possibilities include the highest exposure category in which at least two consecutive years were spent, or the highest exposure category in which the most time was spent. The model is also useful to produce cumulative exposure estimates. Using input data on occupational risk factor exposures from Occupational Safety and Health modules of labour force surveys, for example, this method provides the potential to estimate exposure to various such risk factors. However, it is limited by the quality of the input data and availability of longitudinal data. If longitudinal labour force surveys are unavailable for a country, data from other countries with a similar economic and social structure could be used as a proxy (with the limitations of this acknowledged). The use of the probabilistic Monte Carlo method for the microsimulation in Model 3 is likely less efficient than use of a deterministic method. The model cannot be applied to continuous exposure data. Also, our modelling assumes the same probability of transitioning from present state to the next state (i.e., exposure in year t was purely calculated based on exposure in year t-1 ), independent of previous states (i.e., the entire exposure regimen, such as exposure in year t could be influenced by exposure in year t-1 , year t-2 , etc), which could lead to an overestimate or an underestimate or have no impact on the estimates. It was not possible to model the entire exposure regimen with the data available to us, as the EU-LFS uses a rotating panel design, so individuals are not followed continuously over a period of time. Future research would be needed to test the assumption that transitions are independent of exposure history and, if proven false, microsimulation would need to be developed that takes into consideration full exposure regimen. To improve the approach by overcoming the stochastic nature of the Monte Carlo based microsimulation method, a logical next step is changing Model 3 to a deterministic microsimulation method, such as modelling of microflows [46,47], which we expect to provide gains in model efficiency.

Application of the estimates of exposed population
To produce occupational burden of disease estimates, accurate estimates are required of the proportion of the population who are exposed to occupational risk factors. Our method has potential to increase accuracy in occupational burden of disease studies. It has already been applied to produce WHO/ILO Joint Estimates of the Work-related Burden of Disease and Injury [21,24]. The models could also be used for other risk factors (e.g., environmental risk factors, such as air pollution, water sanitation and hygiene, and chemicals) and with some modifications to model cumulative exposure, such as cumulative dose (in units of, for example, fibres/m 3 or mg/m 3 ) of occupational exposure to asbestos and silica.

Conclusion
As part of the WHO/ILO Joint Estimates, we developed a new approach for modelling populations exposed over a time window using detailed transitions probabilities from longitudinal surveys as input data produced by official statistical offices. The method is a versatile and flexible method with good performance. This approach adds accuracy in the modelling of such exposed population numbers and can help advance estimates of the work-related burden of disease. To improve this modelling approach further, we propose a deterministic microsimulation method is next pursued, such as microsimulation of flows.
Supporting information S1 Fig. a. Median relative fluctuation of exposure estimates as a function of the inverse square root of the synthetic cohort population (symbols). Line is a scaling-law expected from the law of large numbers. b. The median relative error of exposure estimates (%) due to various input data uncertainty terms and their combination. (DOCX) S1